Search CORE

53 research outputs found

PHACTS about activation-based word similarity effects

Author: Calderone Basilio
Celata Chiara
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2012
Field of study

International audienceEnglish phonotactic learning is modeled by means of the PHACTS algorithm, a topo- logical neuronal receptive field implement- ing a phonotactic activation function aimed at capturing both local (i.e., phonemic) and global (i.e., word-level) similarities among strings. Limits and merits of the model are presented

Archivio istituzionale della ricerca - Università di Urbino

Scientific Publications of the University of Toulouse II Le Mirail

Archivio istituzionale della Ricerca - Scuola Normale Superiore

HAL Descartes

L'emergenza del Paradigma. Un modello di apprendimento non-supervisionato applicato al sistema verbale dell'italiano

Author: Calderone Basilio
Publication venue: Scuola Normale Superiore. Laboratorio di Linguistica
Publication date: 06/08/2004
Field of study

Il presente lavoro si pone come obiettivo primario un\u2019indagine sulla natura emergente degli schemi associativi che organizzano, in linea teorica, il lessico mentale di un parlante. In questa prospettiva, sar\ue0 condotta un\u2019applicazione sperimentale delle Self-Organizing Maps (SOMs) ai dati della flessione verbale dell\u2019italiano, considerati come input al dispositivo neurale. Una SOM, come si specificher\ue0 meglio in seguito, si presenta (nel nostro caso) come una struttura connessionistica, capace di organizzare in categorie linguistiche dati di ingresso di natura morfologica. L\u2019organizzazione \ue8 non supervisionata, nel senso che non vi \ue8 presenza di un supervisore esterno che informi il sistema sulle corrette modalit\ue0 della ristrutturazione categoriale. In tali dispositivi neurali, l\u2019apprendimento di dati di input viene a coincidere con la realizzazione di una configurazione spaziale degli stessi su una superficie (mappa) generalmente bi-dimensionale. In tal modo una SOM \ue8 in grado di rilevare l\u2019esistenza di similarit\ue0 pertinenti tra i tratti che descrivono i dati in ingresso e al contempo di categorizzarli in base alla scoperta di tale pertinenza. La pertinenza categoriale \ue8 ottenuta mediante principi di rilevanza statistica. Una simulazione dell\u2019apprendimento morfologico mediante SOMs mira cos\uec a riprodurre la realizzazione di un possibile schema di auto-organizzazione, emergente e funzionale (ai fini organizzativi e di ristrutturazione), dei dati cui la rete \ue8 esposta in fase di addestramento. Questo schema sar\ue0 rappresentato visivamente come una distribuzione topologica dei dati di input sulla mappa, in base ad un qualche criterio emergente di pertinenza linguistica, individuato dalla SOM in modo incrementale

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Morfologia e DSS (Denoising Source Separation)

Author: Calderone Basilio
Publication venue: Scuola Normale Superiore. Laboratorio di Linguistica
Publication date: 01/02/2006
Field of study

Archivio istituzionale della Ricerca - Scuola Normale Superiore

From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

GLÀFF, un Gros Lexique À tout Faire du Français

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 17/06/2013
Field of study

International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article présente GLÀFF, un lexique du français à large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrée une description morphosyntaxique et une transcription phonémique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilité de le faire évoluer de façon constante. Nous décrivons ici comment nous l'avons construit, puis caractérisé en le comparant à différentes ressources connues. Cette comparaison montre que sa taille et sa qualité font de GLÀFF un candidat sérieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Acquisition and enrichment of morphological and morphosemantic knowledge from the French Wiktionary

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: Association for Computational Linguistics and Dublin City University
Publication date: 01/01/2014
Field of study

International audienceWe present two approaches to automatically acquire morphologically related words from Wiktionary. Starting with related words explicitly mentioned in the dictionary, we propose a method based on orthographic similarity to detect new derived words from the entries' definitions with an overall accuracy of 93.5%. Using word pairs from the initial lexicon as patterns of formal analogies to filter new derived words enables us to rise the accuracy up to 99%, while extending the lexicon's size by 56%. In a last experiment, we show that it is possible to semantically type the morphological definitions, focusing on the detection of process nominals

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Ne jetons pas le Wiktionnaire avec l'oripeau du Web ! Études et réalisations fondées sur le dictionnaire collaboratif

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 19/07/2014
Field of study

Wiktionnaire est l'édition française de Wiktionnary, le dictionnaire libre multilingue accessible en ligne. Satellite de Wikipédia, dont il constitue le "compagnon lexical", le projet dictionnairique reste dans l'ombre de l'encyclopédie. Fondé comme elle sur le principe du wiki, il peut être alimenté et modifié par tout internaute, avec publication immédiate. Si la ressource encyclopédique a été abondamment utilisée dans certaines disciplines, le dictionnaire collaboratif semble avoir reçu moins d'attention de la part de la communauté scientifique. Ce moindre intérêt pourrait être le fruit d'une méconnaissance ou d'un rejet a priori de l'amateurisme que l'on associe volontiers aux contributions effectuées par des naïfs. Nous présentons dans cet article quelques caractéristiques du Wiktionnaire, ainsi que des réalisations issues de cette ressource. Ce travail entend illustrer les possibilités offertes par ce dictionnaire singulier et permettre de décider si l'on peut tirer ou non bénéfice de son exploitation, et pour quel usage. Plus précisément, nous questionnons la légimité des ressources approvisionnées "par les foules" et nous étudions dans quelle mesure le Wiktionnaire peut, par ses spécificités, compléter les ressources dictionnairiques existantes dans le cadre d'études linguistiques et, d'autre part, servir de point de départ à la constitution d'un lexique électronique pour des domaines comme le traitement automatique des langues et la psycholinguistique. Notre contribution à la caractérisation du Wiktionnaire s'accompagne de la mise à disposition de deux lexiques construits à partir du dictionnaire collaboratif. Le premier est un lexique morphophonologique à très large couverture. Destiné notamment aux applications de TAL, nous donnons des exemples possibles d'utilisation en linguistique outillée. Le second est un lexique orienté vers la psycholinguistique. Dérivé du premier, il contient moins d'entrées, mais comprend pour chacune d'elle un ensemble d'informations habituellement utilisées dans cette discipline. Ces lexiques sont à la fois sont téléchargeables et interrogeables en ligne

Scientific Publications of the University of Toulouse II Le Mirail

EDP Sciences OAI-PMH repository (1.2.0)

HAL Descartes

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 17/09/2012
Field of study

International audienceWe describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes